Documentation Index Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Fine-tuning allows you to adapt pretrained CLIP models to specific domains or datasets by continuing training from a pretrained checkpoint. This is often more efficient than training from scratch and can achieve better performance with less data.
When to Fine-tune
✅ Fine-tune when:
You have a pretrained model that’s close to your target domain
You have limited training data (1M-100M samples)
You want to adapt to a specific domain (medical images, satellite imagery, etc.)
You need faster convergence than training from scratch
You want to improve zero-shot performance on specific tasks
❌ Train from scratch when:
Your domain is very different from the pretrained model’s training data
You have massive amounts of training data (>1B samples)
You need a completely custom architecture
You want to experiment with new training objectives
Loading Pretrained Weights
From OpenCLIP Pretrained Models
Use the --pretrained flag with a model tag:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/custom_dataset.tar" \
--lr 1e-5 \
--epochs 10
Available pretrained tags:
import open_clip
# List all pretrained models
for model_name, pretrained in open_clip.list_pretrained():
print ( f " { model_name } : { pretrained } " )
Common pretrained weights:
laion2b_s34b_b79k: ViT-B/32 on LAION-2B
laion2b_s32b_b82k: ViT-L/14 on LAION-2B
openai: Original OpenAI CLIP weights
datacomp_xl_s13b_b90k: DataComp-1B models
From Local Checkpoint
Use a local checkpoint file:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained /path/to/checkpoint.pt \
--train-data "/data/custom_dataset.tar" \
--lr 1e-5 \
--epochs 10
From Hugging Face Hub
Download from Hugging Face and use local path:
# Download from HF
wget https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/open_clip_pytorch_model.bin
# Use in training
python -m open_clip_train.main \
--model ViT-L-14 \
--pretrained /path/to/open_clip_pytorch_model.bin \
# ... other arguments
Resuming Training from Checkpoint
The --resume flag continues training from a checkpoint, including optimizer state:
python -m open_clip_train.main \
--train-data "/path/to/train_data.csv" \
--val-data "/path/to/validation_data.csv" \
--resume /path/to/checkpoints/epoch_K.pt \
--model ViT-B-32 \
# ... other arguments should match original training
Resume vs Pretrained:
Flag Use Case Loads Optimizer Loads Epoch Learning Rate --resumeContinue interrupted training ✅ Yes ✅ Yes Original schedule continues --pretrainedFine-tune from pretrained ❌ No ❌ No New schedule from epoch 0
Resume from Latest Checkpoint
python -m open_clip_train.main \
--resume latest \
# ... other arguments
Automatically finds and loads the most recent checkpoint in the logs directory.
Fine-tuning Strategies
1. Full Model Fine-tuning
Fine-tune all parameters with a lower learning rate:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/domain_specific.tar" \
--train-num-samples 10000000 \
--dataset-type webdataset \
--lr 1e-5 \
--warmup 1000 \
--epochs 10 \
--batch-size 256 \
--precision amp \
--workers 8
Key changes from pretraining:
⬇️ Lower learning rate: 1e-5 vs 1e-3 for pretraining
⏱️ Fewer epochs: 10 vs 32 for pretraining
🔥 Shorter warmup: 1000 vs 10000 steps
2. Frozen Image Encoder (Text-Only Fine-tuning)
Freeze the image encoder and only fine-tune the text encoder:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--lock-image \
--train-data "/data/domain_specific.tar" \
--lr 1e-4 \
--epochs 10
Benefits:
💾 Lower memory usage
⚡ Faster training
🎯 Useful when adapting to new vocabulary/concepts
3. Frozen Text Encoder (Image-Only Fine-tuning)
Freeze the text encoder and only fine-tune the image encoder:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--lock-text \
--train-data "/data/domain_specific.tar" \
--lr 1e-4 \
--epochs 10
Use cases:
Adapting to new image domains (medical, satellite, etc.)
Maintaining text understanding while improving visual features
4. Partial Fine-tuning
Freeze early layers and fine-tune later layers:
# Fine-tune last 2 groups of image encoder
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--lock-image \
--lock-image-unlocked-groups 2 \
--train-data "/data/domain_specific.tar" \
--lr 5e-5 \
--epochs 10
# Fine-tune last 10 layers of text encoder
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--lock-text \
--lock-text-unlocked-layers 10 \
--train-data "/data/domain_specific.tar" \
--lr 5e-5 \
--epochs 10
Benefits:
⚖️ Balance between adaptation and preservation
💾 Lower memory and compute requirements
🛡️ Less prone to overfitting on small datasets
5. LiT (Locked Image Tuning)
Lock image encoder with ImageNet pretrained weights, train text encoder from scratch:
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained-image \
--lock-image \
--lock-image-freeze-bn-stats \
--train-data "/data/train.tar" \
--lr 1e-3 \
--epochs 32
Reference: LiT: Zero-Shot Transfer with Locked-image Text Tuning
Learning Rate Adjustment
Fine-tuning requires careful learning rate selection:
Recommended Learning Rates
Strategy Learning Rate Warmup Steps Epochs Full fine-tuning 1e-5 to 1e-4 1000-5000 5-10 Partial fine-tuning 5e-5 to 5e-4 1000-3000 5-10 Frozen encoder 1e-4 to 1e-3 1000-5000 10-20 From scratch (reference) 1e-3 to 5e-3 10000 32+
Learning Rate Schedules
Cosine with warmup (recommended):
--lr 1e-5 \
--warmup 1000 \
--lr-scheduler cosine \
--epochs 10
Constant with warmup:
--lr 1e-5 \
--warmup 1000 \
--lr-scheduler const \
--epochs 10
Constant with cooldown:
--lr 1e-4 \
--warmup 1000 \
--lr-scheduler const-cooldown \
--epochs-cooldown 2 \
--lr-cooldown-end 1e-6 \
--epochs 10
Fine-tuning Examples
Domain Adaptation: Medical Images
# Fine-tune ViT-B/32 on medical imaging dataset
torchrun --nproc_per_node 4 -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/medical_images/train-{0000..0100}.tar" \
--train-num-samples 1000000 \
--dataset-type webdataset \
--batch-size 256 \
--precision amp \
--workers 6 \
--lr 5e-5 \
--warmup 2000 \
--epochs 10 \
--save-frequency 2 \
--imagenet-val /data/imagenet/val/ \
--report-to wandb \
--name "vit-b32-medical-finetuned"
Small Dataset Fine-tuning
# Fine-tune on small dataset (100k samples)
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/small_dataset.csv" \
--dataset-type csv \
--csv-img-key filepath \
--csv-caption-key title \
--batch-size 128 \
--precision amp \
--workers 4 \
--lr 1e-5 \
--warmup 500 \
--epochs 20 \
--lock-image \
--lock-image-unlocked-groups 1 \
--report-to tensorboard
Multilingual Fine-tuning
# Adapt to new language while keeping visual encoder
python -m open_clip_train.main \
--model ViT-L-14 \
--pretrained laion2b_s32b_b82k \
--lock-image \
--train-data "/data/chinese_captions.tar" \
--train-num-samples 50000000 \
--dataset-type webdataset \
--batch-size 256 \
--lr 1e-4 \
--warmup 5000 \
--epochs 15 \
--precision amp \
--workers 8
High-Resolution Fine-tuning
# Fine-tune at higher resolution (336px instead of 224px)
python -m open_clip_train.main \
--model ViT-L-14 \
--pretrained laion2b_s32b_b82k \
--force-image-size 336 \
--train-data "/data/high_res.tar" \
--batch-size 128 \
--precision amp \
--grad-checkpointing \
--lr 1e-5 \
--epochs 5
WiSE-FT: Robust Fine-tuning
For robust fine-tuning that maintains performance under distribution shift, use the WiSE-FT repository .
WiSE-FT (Weight-Space Ensembling for Fine-Tuning) averages the weights of:
Zero-shot pretrained model
Fine-tuned model
This preserves robustness while improving accuracy.
WiSE-FT Workflow
# 1. Fine-tune on target dataset
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/imagenet/train.csv" \
--lr 1e-5 \
--epochs 10 \
--name "imagenet-finetuned"
# 2. Use WiSE-FT to ensemble weights
# See https://github.com/mlfoundations/wise-ft for details
Reference: Robust Fine-tuning of Zero-shot Models
Monitoring Fine-tuning
Zero-shot Evaluation
Track zero-shot performance during fine-tuning:
python -m open_clip_train.main \
--pretrained laion2b_s34b_b79k \
--imagenet-val /data/imagenet/val/ \
--zeroshot-frequency 1 \
# ... other arguments
Monitor both:
Fine-tuning dataset performance (improves)
Zero-shot ImageNet accuracy (may degrade if overfitting)
Validation Loss
python -m open_clip_train.main \
--train-data "/data/train.tar" \
--val-data "/data/val.tar" \
--val-frequency 1 \
# ... other arguments
Weights & Biases Logging
python -m open_clip_train.main \
--report-to wandb \
--wandb-project-name "clip-finetuning" \
--wandb-notes "Fine-tuning ViT-B/32 on medical images" \
# ... other arguments
Common Fine-tuning Issues
Overfitting
Symptoms:
Training loss decreases, validation loss increases
Zero-shot performance degrades significantly
Solutions:
Reduce learning rate
Use fewer epochs
Freeze more layers
Add regularization (increase --wd)
Use more data augmentation
Underfitting
Symptoms:
Both training and validation loss remain high
No improvement over pretrained model
Solutions:
Increase learning rate
Train for more epochs
Unfreeze more layers
Reduce regularization
Catastrophic Forgetting
Symptoms:
Good performance on fine-tuning dataset
Poor zero-shot performance on general tasks
Solutions:
Use lower learning rate
Freeze early layers
Use WiSE-FT weight ensembling
Mix fine-tuning data with general data
Best Practices
Fine-tuning checklist:
✅ Start with a pretrained model close to your domain
✅ Use 10-100× lower learning rate than pretraining
✅ Fine-tune for 5-20 epochs (much less than pretraining)
✅ Monitor both task performance and zero-shot performance
✅ Try partial fine-tuning before full fine-tuning
✅ Use validation set to prevent overfitting
✅ Consider WiSE-FT for robust fine-tuning
✅ Save checkpoints frequently for comparison
Avoid:
❌ Using same learning rate as pretraining
❌ Fine-tuning for too many epochs
❌ Ignoring zero-shot performance degradation
❌ Not using validation data
❌ Forgetting to set --pretrained flag
Fine-tuning Templates
Quick Fine-tuning (Small Dataset)
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--train-data "/data/small.csv" \
--dataset-type csv \
--lr 1e-5 \
--epochs 10 \
--batch-size 128
Production Fine-tuning (Large Dataset)
torchrun --nproc_per_node 8 -m open_clip_train.main \
--model ViT-L-14 \
--pretrained laion2b_s32b_b82k \
--train-data "/data/train-{0000..9999}.tar" \
--train-num-samples 100000000 \
--val-data "/data/val-{0000..0099}.tar" \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 128 \
--precision amp \
--grad-checkpointing \
--workers 8 \
--lr 5e-5 \
--warmup 5000 \
--epochs 10 \
--save-frequency 1 \
--imagenet-val /data/imagenet/val/ \
--zeroshot-frequency 1 \
--local-loss \
--gather-with-grad \
--report-to wandb \
--name "production-finetune"
Conservative Fine-tuning (Preserve Generalization)
python -m open_clip_train.main \
--model ViT-B-32 \
--pretrained laion2b_s34b_b79k \
--lock-image \
--lock-image-unlocked-groups 1 \
--train-data "/data/domain.tar" \
--lr 1e-5 \
--warmup 2000 \
--epochs 5 \
--imagenet-val /data/imagenet/val/ \
--zeroshot-frequency 1
Next Steps
Training Overview Learn about training CLIP models from scratch
Configuration Explore all fine-tuning parameters
Pretrained Models Browse available pretrained models
WiSE-FT Learn about robust fine-tuning with weight ensembling